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this  paper  we  examiftes^the  packet  routing  problem  in  a  network  independent  context. 
r-Ow  goal  is  to  devise  a!  strategy  for  routing  that  works  well  for  a  wide  variety  of  networks. 
To  achieve  this  goal,  vie  partition  the  routing  problem  into  two  stages:  a  path  selection 
stage  and  a  scheduling  stag^ 

''  In  the  first  stage  we  find  paths  for  the  packets  with  small  maximum  distance,  d,  and  small 
maximum  congestion,  c.  Once  the  paths  are  fixed,  both  are  lower  bounds  on  the  time 
required  to  deliver  the  packets.  In  the  second  stage  we  find  a  schedule  for  the  movement 
of  each  packet  along  its  path  so  that  no  two  packets  traverse  the  same  edge  at  the  same 
time,  and  so  that  the  total  time  and  maximum  queue  size  required  to  route  all  of  the 
packets  to  their  destinations  are  minimized.  For  many  graphs,  the  first  stage  is  easy  -  we 
simply  use  randomized  intermediate  destinations  as  suggested  by  Valiant.  The  second 
stage  is  more  challenging,  however,  and  is  the  focus  of  this  paper.|Our  results  include: 

1.  a  proof  that  there  is  a  schedule  of  length  0(c+d)  requiring  only  constant  size  queues 
for  any  set  of  paths  with  distance  d  and  congestion  c, 

2.  a  Randomized  on-line  algorithm  for  routing  any  set  of  N  “leveled”  paths  on  a 
bounded-degree  network  in  0(c+d+log  N)  steps  using  constant  size  queues, 

3.  the  first  on-line  algorithm  for  routing  V-packets  in  the  V-node  shuffle-exchange  graph 
in  0(log  N)  steps  using  constant  size  queues,  and 

4.  the  first  constructions  of  area  and  volume-universal  networks  requiring  only  0(log  U) 
slow-down. 
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Abstract 

In  this  paper  we  examine  the  packet  routing  problem  in 
a  network  independent  context.  Our  goal  is  to  devise  a 
strategy  for  routing  that  works  well  for  a  wide  variety  of 
networks.  To  achieve  this  goal,  we  partition  the  routing 
problem  into  two  stages:  a  path  selection  stage  and  a 
scheduling  stage.  In  the  first  stage  we  find  paths  for 
the  packets  with  small  maximum  distance,  d,  and  small 
maximum  congestion,  c.  Once  the  paths  are  fixed,  both 
are  lower  bounds  on  the  time  required  to  deliver  the 
packets.  In  the  second  stage  we  find  a  schedule  for 
the  movement  of  each  packet  along  its  path  so  that  no 
two  packets  traverse  the  same  edge  at  the  same  time, 
and  so  that  the  total  time  and  maximum  queue  size 
required  to  route  all  of  the  packets  to  their  destinations 
are  minimized.  For  many  graphs,  the  first  stage  is  easy 
-  we  simply  use  randomized  intermediate  destinations 
as  suggested  by  Valiant.  The  second  stage  is  more 
challenging,  however,  and  is  the  focus  of  this  paper. 
Our  results  include: 

1.  a  proof  that  there  is  a  schedule  of  length  0(e  4-  d) 
requiring  only  constant  size  queues  for  any  set  of 
paths  with  distance  d  and  congestion  c, 

2.  a  randomized  on-line  algorithm  for  routing  any  set 
of  N  “leveled”  paths  on  a  bounded-degree  network 
in  0(c-i-d-f  log  N)  steps  using  constant  size  queues, 

3.  the  first  on-line  algorithm  for  routing  V-packets 
in  the  A/^-node  shuffle-exchange  graph  in  O(logN) 
steps  using  constant  size  queues,  and 

4.  the  first  constructions  of  area  and  volume- 
universal  networks  requiring  only  O(logV)  slow¬ 
down. 
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1  Introduction 

1.1  Background 

The  task  of  designing  an  efficient  packet  routing  algo¬ 
rithm  is  central  to  the  design  of  most  large-scale  general 
purpose  parallel  computers.  In  fact,  even  the  basic  unit 
of  time  in  some  parallel  machines  is  measured  in  terms 
of  how  fast  the  packet  router  operates.  For  example, 
the  speed  of  an  algorithm  in  the  Connection  Machine 
is  often  measured  in  terms  of  nuUng  rgelet  (roughly 
the  time  to  route  a  random  permutation)  or  peiii  cycles 
(the  time  to  perform  an  atomic  step  of  the  routing  al¬ 
gorithm).  Si^arly,  the  performance  of  machines  like 
the  BBN  Butterfly  is  substantially  influenced  by  the 
speed  and  rate  of  successful  delivery  of  its  router. 

Packet  routing  also  provides  an  important  bridge  be¬ 
tween  theoretical  computer  science  and  applied  com¬ 
puter  science;  it  is  through  packet  routing  that  a  real 
machine  such  as  the  Connection  Machine  is  able  to  sim¬ 
ulate  an  idealized  machine  such  as  the  CRCW  PRAM. 
More  generally,  getting  the  right  data  to  the  right  place 
at  the  right  time  is  an  important,  interesting,  and  chal¬ 
lenging  problem.  Not  surprisingly,  it  has  also  been  the 
subject  of  a  great  deal  of  research. 

1.2  Past  work 

The  first  major  result  in  packet  routing  is  due  to 
Batcher  [3}  who  devised  an  elegant  and  practical  al¬ 
gorithm  for  routing  any  permutation  of  N  packets  on 
an  .V-processor  shuffle-exchange  graph  in  log^  N  steps. 
The  result  extends  to  routing  many-one  problems  pro¬ 
vided  that  (as  is  typically  assumed)  combining  can  be 
used  to  merge  packets  that  have  a  common  destination. 

No  better  deterministic  algorithm  was  found  un¬ 
til  Ajtai,  Komlos,  and  Szemeredi  (1]  solved  a  classic 
open  problem  by  constructing  an  0(iog  V)-depth  sort¬ 
ing  network.  Leighton  [11]  then  used  this  0{N  log  A')- 
node  network  to  construct  a  degree  3  N-node  network 
capable  of  solving  any  A^-packet  routing  problem  in 
0(log  N)  steps.  Although  this  result  is  optimal  up  to 
constant  factors,  the  constant  factors  are  quite  large 
and  the  algorithm  is  of  no  practical  use.  Hence,  the  ef¬ 
fort  to  find  fast  deterministic  algorithms  has  continued. 


Thus  fur,  the  best  smaU-constant-fsctor  deterministic 
algorithm  is  an  0(iog’  N/  log  log  yv^)-step  algorithm  for 
routing  on  the  butterfly. 

There  has  been  comparatively  much  greater  success 
in  the  development  of  efficient  randomized  packet  rout¬ 
ing  algorithms.  The  study  of  randomized  algorithms 
was  pioneered  by  Valiant  and  Brebner  [25]  who  showed 
how  to  route  any  pernmitation  of  N  packets  in  0(log  N) 
steps  on  an  A/^-node  hypercube  with  queues  of  size 
O(logN)  at  each  node.  Although  the  dgorithm  was 
not  always  guaranteed  to  work,  it  was  guaranteed  to 
work  with  probability  at  least  1  —  1/N  for  any  permu¬ 
tation.  This  result  was  improved  in  a  succession  of  fun¬ 
damental  papers  by  Aleliunas  [2],  Upfal  [24],  Pippenger 
[17],  and  Ranade  [18].  Aleliunas  and  UpM  develop 
the  notion  of  a  delay  path  and  showed  how  to  route 
on  the  shuffle-exchange  and  butterfly  graphs  (respec¬ 
tively)  in  0(log  N)  steps  with  queues  of  size  0(log  N). 
Pippenger  was  the  first  to  eliminate  the  need  for  large 
queues,  and  showed  how  to  route  on  a  variant  ot  the 
butterfly  in  0(log  N)  steps  with  queues  of  size  0(1). 
Ranade  showed  how  combining  could  be  used  to  ex¬ 
tend  the  Pippenger  result  to  include  many-one  routing 
problems,  and  tremendously  simplified  the  analysis  re¬ 
quired  to  prove  such  a  result.  As  a  consequence  of 
Ranade’s  work,  it  has  finally  become  possible  to  sim¬ 
ulate  a  step  of  an  V-processor  CRCW  PRAM  on  an 
A^-node  butterfly  or  hypercube  in  0(log  N)  steps  using 
constant  size  queues  on  each  edge. 

Concurrent  with  the  development  of 
these  hypercube-related  packet  routing  algorithms  has 
been  the  development  of  algorithms  for  routing  in  ar¬ 
rays.  Kunde  [8]  showed  how  to  route  any  permutation 
of  N  packets  deterministically  in  (2  -(■  e)y/W  steps  us¬ 
ing  queues  of  size  0(l/e).  Also,  Krizanc,  Rajasekaran, 
and  Tsantilis  [7]  showed  how  to  randomly  route  any 
permutation  in  2\/iv  -f  O(logA^)  steps  using  constant 
size  queues.  Most  recently,  Leighton,  Makedon  and 
ToUis  discovered  a  deterministic  algorithm  for  routing 
any  permutation  in  2y/N  -  2  steps  using  constant  size 
queues,  thus  achieving  the  optimal  time  bound  in  the 
worst  case. 


1.3  Our  approach 

One  deficiency  with  the  state-of-the-art  in  packet  rout¬ 
ing  is  that  aside  from  Valiant’s  paradigm  of  ‘ffirst  rout¬ 
ing  to  a  random  destination,”  all  of  the  algorithms  and 
their  analyses  are  very  specifically  tied  to  the  network 
on  which  the  routing  is  to  take  place,  as  well  as  to  the 
requirement  that  packets  are  first  routed  to  destina¬ 
tions  that  are  (in  some  sense)  random.  For  example, 
the  butterfly  routing  algorithms  are  all  quite  different 
than  the  array  algorithms  in  the  way  that  queue  size 
is  kept  constant.  Moreover,  the  butterfly  and  hyper¬ 
cube  algorithms  are  so  specific  to  those  networks  that 
no  0(log  Af)-step  constant-queue-size  algorithm  was 
known  for  the  closely  related  shuffle-exchange  graph. 


The  lack  of  a  good  routing  algorithm  for  the  shuffle- 
Mchange  graph  is  one  of  the  reasons  that  the  butterfly 
is  preferred  to  the  shuffle-exchange  graph  in  practice. 

In  this  paper,  we  take  a  significant  step  towards  the 
development  of  a  universal  approach  to  packet  rout¬ 
ing.  Our  approach  to  the  pr<^lem  differs  from  previ¬ 
ous  approaches  in  that  we  separate  the  process  of  se¬ 
lecting  packet  paths  from  the  process  of  timing  packet 
movements  along  the  paths.  More  precisely,  given  any 
underlying  network,  and  any  selection  of  paths  for  the 
packets,  we  study  the  problem  of  timing  the  movement 
of  the  packets  so  as  to  minimize  the  total  time  and 
maximum  queue  size  needed  to  route  all  the  packets  to 
their  correct  destinations. 

Of  course,  there  must  be  some  correlation  between 
the  performance  of  the  algorithm  and  the  selection  of 
the  paths.  In  particular,  the  maximum  distance  d  trav¬ 
eled  by  any  packet  is  always  a  lower  bound  on  the  time 
required  to  route  all  packets,  as  is  the  congestion  c  of 
the  paths.  (The  eonftfiion  of  a  collection  of  packet 
paths  is  the  largest  number  of  packeU  that  must  tra¬ 
verse  a  single  edge  during  the  entire  course  of  the  rout- 

in*) 

Viewed  in  terms  of  these  parameters,  then,  a  routing 
problem  can  be  broken  into  two  stages.  In  Stage  1,  we 
select  paths  for  the  packets  so  as  to  minimize  c  and  d. 
In  Stage  2,  we  schedule  the  movement  of  the  packets 
so  as  to  minimize  the  total  time  and  maximum  queue 
size. 

For  many  networks.  Stage  1  is  easy.  We  simply  use 
Valiant’s  paradigm  of  first  routing  to  a  random  desti¬ 
nation,  and  then  routing  to  the  correct  destination.  U 
is  easily  shown  for  arrays,  butterflies,  shuffle-exchange 
graphs,  etc.,  that  this  approach  yields  values  of  c  and  d 
that  are  within  a  small  constant  factor  of  the  diameter 
of  the  network,  which  is  as  well  as  can  be  done.  More¬ 
over,  this  technique  also  usually  works  for  many-one 
problems  provided  that  the  address  space  is  randomly 
hashed. 

Stage  2  has  traditionally  been  the  hard  part  of  rout¬ 
ing.  Curiously,  however,  we  have  found  that  by  ignor¬ 
ing  the  underlying  network  and  the  method  of  path  se¬ 
lection,  Stage  2  actually  becomes  easier  to  solve!  Hence 
we  will  be  able  to  obtain  results  for  routing  that  are 
both  simpler  and  far  more  general  than  existing  ap¬ 
proaches.  Among  other  things,  we  will  be  able  to 
route  on  the  Af-node  mesh  in  0{>/N)  steps  using  con¬ 
stant  size  queues  with  the  same  algorithm  that  uses 
0(log  Af)  steps  and  constant  size  queues  on  the  but¬ 
terfly.  We  will  also  be  able  to  route  on  the  shuffle- 
exchange  graph  in  O(iogAf)  steps  with  constant  size 
queues.  Also,  we  provide  the  first  examples  of  volume 
and  area-universal  networks  that  require  only  0(log  .V) 
slowdown  by  showing  how  to  route  efficiently  on  a  fai- 
tree. 


1.4  Outline  of  the  result* 

Our  moat  difficult  result  is  a  proof  that  any  set  of 
packet  paths  with  congestion  c  and  distance  d  can  be 
scheduled  so  as  to  complete  the  routing  in  0{e  •(-  d) 
steps  using  constant  sise  queues.  This  result  is  optimal 
up  to  constant  factors,  and  substantially  improves  the 
naive  bound  ot  0{ed)  steps  and  0(c)  size  queues.  Un¬ 
fortunately,  the  result  is  highly  nonconstructive,  and 
therefore  is  useful  only  if  substantial  amounts  of  off¬ 
line  computation  are  available  for  the  routing.  On  the 
other  hand,  the  result  is  robust  in  the  sense  that  it  pro¬ 
vides  a  near-optimal  schedule  of  packet  movements  for 
any  set  of  pati^  and  any  underlying  network.  Such  ro¬ 
bustness  is  particularly  useful  when  dealing  with  rout¬ 
ing  problems  on  arbitrary  distributed  networks  as  in 
[12].  The  proof  of  the  result  is  contained  in  Section  2. 

We  do  not  know  whether  or  not  there  is  an  on-line 
algorithm  that  can  route  any  set  of  paths  in  0(c  ■+■  d) 
steps  with  constant  size  queues.  It  is  not  difficult  to 
devise  a  randomized  on-line  algorithm  to  schedule  any 
set  of  N  paths  in  0(c  dlogTV)  steps  using  queues 
of  size  O(lQgAI).  In  special  cases,  however,  we  can 
do  better.  For  example,  a  slight  variant  of  Ranade’s 
algorithm  can  be  used  to  schedule  on-line  any  leveled 
set  of  N  paths  on  a  bounded-degree  network  in  0(e  -i- 
d-t-log  N)  steps  using  constant  size  queues.  By  a  leveled 
set  of  paths,  we  mean  a  set  of  paths  for  which  each 
packet  starts  from  a  level  one  node,  progresses  from  a 
level  t  node  to  a  level  t-f  1  node  at  each  step,  and  ends 
at  a  level  d  node.  For  example,  greedy  paths  on  the 
butterfly  are  leveled  in  this  fashion.  The  algorithm  is 
randomized,  but  requires  only  6(log  iVloglog  jV)  bits 
of  randomness  to  succeed  with  high  probability.  The 
proof  of  this  result  is  included  in  Action  3.  Curiously, 
the  proof  is  simpler  than  the  previous  proof  of  the  same 
result  applied  specifically  to  routing  random  paths  in 
butterflies  [18].  (The  fact  that  Ranade’s  algorithm  can 
be  used  in  this  general  context  has  also  been  observed 
by  Ranade  [19].) 

The  on-line  algorithm  for  leveled  networks  can  im¬ 
mediately  be  applied  to  obtain  good  routing  algo¬ 
rithms  for  arrays  and  butterflies.  With  some  extra 
effort,  it  can  also  be  applied  to  obtain  the  first  con¬ 
stant  queue  size  algorithm  for  routing  on  the  shuffle- 
exchange  graph.  It  can  also  be  applied  to  construct  a 
class  of  networks  that  are  area  universal  in  the  sense 
that  the  network  in  the  class  with  N  processors  has 
area  0{S),  and  can,  with  high  probability,  simulate 
in  0{\o%N)  steps  each  step  of  any  other  network  of 
area  0(N).  An  analogous  result  is  shown  for  a  class  of 
volume  universal  networks.  The  detaib  of  these  appli¬ 
cations  are  included  in  Section  4. 

This  paper  leaves  open  the  question  of  whether  or 
not  there  is  an  on-line  algorithm  that  can  schedule  any 
set  of  paths  in  0(c+d)  steps  using  constant  sise  queues. 
We  suspect  that  finding  such  an  algorithm  (if  one  ex¬ 
ists)  will  be  a  challenging  task.  Our  negative  suspi¬ 
cions  are  derived  from  the  fact  that  we  can  construct 


counterexamples  to  most  of  the  simplest  on-line  algo¬ 
rithms.  In  other  words,  for  several  natural  on-line  al¬ 
gorithms  (including  the  algorithm  described  in  Section 
3)  we  can  find  packet  paths  for  which  the  algorithm 
will  construct  a  schedule  using  substantially  more  than 
n(c  -f  d  -f  log  N)  steps.  Several  of  the  counterexamples 
are  included  in  Section  5. 


2  An  0(c  +  d)  off-line  algorithm 


In  this  section  we  show  that  for  any  set  of  paths  with 
maximum  congestion  c  and  maximum  distance  d,  in 
any  network,  there  is  a  schedule  of  length  0(c  -t>  d)  in 
which  at  most  one  packet  traverses  each  edge  of  the 
network  at  each  step,  and  at  most  0(1)  packets  wait 
in  each  queue  at  each  step.  In  our  routing  network 
model,  all  packets  are  stored  in  queues  at  the  ends  of 
edges.  At  each  time  step  a  packet  either  waits  in  a 
queue  or  traverses  an  edge  and  enters  the  queue  at  the 
end  of  that  edge.  We  assume  that  at  the  beginning  and 
end  of  tht;  routing  there  is  one  packet  in  each  queue. 

A  schedule  for  a  set  of  packets  simply  specifies  at  each 
time  step  which  packets  move  and  which  wait. 

Our  strategy  for  constructing  an  efficient  schedule 
is  to  make  a  succession  of  refinements  to  the  “greedy” 
schedule,  Si ,  in  which  each  packet  moves  at  every  step 
until  it  reaches  its  final  destination.  The  length  of  Si 
is  only  d,  but  it  does  not  meet  the  requirement  that  at 
most  one  packet  traverses  each  edge  of  the  network  at 
each  step,  since  as  many  as  c  packets  may  use  an  edge 
in  a  single  step.  Each  refinement  brings  us  closer  to 
meeting  this  requirement  by  bounding  the  congestion 
within  smaller  and  smaller  frames  of  time.  A  T-frame 
is  a  sequence  of  T  consecutive  time  steps.  The  frame 
congestion,  C,  in  a  T-frame  is  the  largest  number  of 
packets  that  traverse  any  edge  in  the  frame.  The  rel¬ 
ative  congestion,  R,  in  a  T-frame  is  the  ratio  C/T  of 
the  congestion  in  the  frame  to  the  size  of  the  frame. 

A  refinement  transfornu  a  schedule  5,  with  relative 
congestion  at  most  in  any  frame  of  size  or 
greater  into  a  schedule  5i4.i  with  relative  congestion 
at  most  in  any  frame  of  size  or  greater, 

where  w  and  <<  T').  (For  ease  of  no¬ 
tation,  we  use  7  and  r  in  place  of  and  r^'l.)  We  shall 
assume  without  loss  of  generality  that  c  =  d.  Thus,  at 
the  start,  the  relative  congestion  in  a  d-frame  of  5i  is 
at  most  1.  After  a  series  of  j  =  0(log*  d)  refinements, 
we  obtain  a  schedule  5.  with  relative  congestion  0(  1 ) 
in  every  frame  of  size  ko  or  greater,  where  to  is  some 
constant.  From  Sj  it  is  straightforward  to  construct  a 
schedule  of  length  O(c-fd)  in  which  at  most  one  packet  . 
traverses  each  edge  of  the  network  at  each  step,  and  at 
most  0(1)  packets  wait  in  each  queue  at  each  step. 

In  the  ith  refinement,  schedule  Si  is  broken  into 
hlocis  of  27*  +  27*  —  7  consecutive  time  steps.  Each 
block  is  rescheduled  independently.  For  each  block, 
each  packet  is  assigned  a  random  delay  chosen  indepen-  ^ 
dently  and  uniformly  from  1  to  7.  A  packet  assigned  ' 


.  1  ion/ 


□  n 


a  delay  of  x  must  wait  for  z  steps  at  the  beginning 
of  the  block.  In  order  to  bound  the  queue  size  and 
length  of  our  final  schedule,  it  is  crucial  that  we  main¬ 
tain  the  invariant  that  in  schedule  5i.fi  every  packet 
waits  at  most  once  every  steps.  Thus,  instead  of 
delaying  the  packet  for  z  consecutive  steps  at  the  be¬ 
ginning  of  the  block,  we  insert  one  delay  every  I  steps 
in  the  first  xl  steps  of  the  block.  ^  A  packet  that  is 
delayed  z  steps  reaches  its  destination  at  the  end  of 
the  block  by  step  27’  -t-  2/^  —  I  +  x.  Since  some  packet 
may  have  delay  z  =  /,  the  rescheduled  block  must  have 
length  27’  -f  27^.  In  order  to  independently  resched¬ 
ule  the  next  block,  the  packets  must  reside  in  exactly 
the  same  queues  at  the  end  of  the  rescheduled  block 
that  they  did  at  the  end  of  the  block  of  Si-  Since  some 
packets  arrive  early,  they  must  be  slowed  down.  Thus, 
a  delay  is  inserted  every  7  steps  in  the  last  7(7  -  z) 
steps  of  the  block.  Note  that  at  the  beginning  of  the 
first  block  and  end  of  the  last  block,  it  is  not  necessary 
to  separate  the  delays  by  7  steps. 

After  adding  delays  to  5.,  the  congestion  may  have 
increased  in  the  7^  steps  at  the  beginning  and  end  of 
each  block.  The  following  lemma  shows  that  by  in¬ 
creasing  the  frame  size  from  7  to  P  we  can  bound  the 
relative  congestion  in  these  regions. 

Lemma  1  The  relative  congestion  in  any  frame  of  size 
7*  or  greater  ts  at  most  r(l  -f  1/7). 

Proof:  After  the  delays  are  inserted,  a  packet  can  use 
an  edge  in  a  T-frame  if  it  used  the  edge  in  the  frame 
or  in  any  of  the  7  steps  before  the  frame  in  5, .  Thus, 
at  most  r(T  +  7)  packets  can  use  an  edge  in  the  T- 
frame.  For  T  >  P,  the  relative  congestion  is  at  most 

We  now  show  that  there  is  some  way  of  choosing  the 
delays  so  that  in  between  the  first  and  last  P  steps 
we  can  decrease  the  frame  size  substantially  without 
increasing  the  congestion  much.  The  proof  makes  use 
of  two  lemmas.  The  first  is  used  to  bound  the  relative 
congestion  over  a  wide  range  of  frame  sizes.  The  second 
is  the  quintessential  tool  of  the  probabilistic  method: 
the  Lovasz  local  lemma  [22,  pp.  57-58]. 

Lemma  2  In  any  schedule,  if  the  relative  congestion 
in  every  frame  of  size  T  to  2T— I  is  at  most  R  then  the 
relative  congestion  in  any  frame  of  size  T  or  greater  is 
at  most  R. 

Proof:  Consider  a  frame  of  size  T*,  where  V  >2T-1. 
The  first  ([T'/TJ  —  1)T  steps  of  the  frame  can  be 
broken  into  T-frames,  each  with  relative  congestion 

’Before  the  deUyt  for  ecbedule  5,41  have  been  inserted,  a 
packet  is  delayed  at  most  once  in  ea^  block  of  S,.  Prior  to 
inserting  each  new  delay  into  a  block,  we  check  if  it  is  within 
steps  ot  the  sin^  old  delay.  If  the  new  delay  would  be  too 
close  to  the  old  delay,  then  it  is  simply  not'inserted.  The  loss  of 
a  single  delay  in  a  block  has  a  negligible  effect  on  the  probability 
calodations  in  the  lemmas  that  follow. 


R.  The  remainder  of  the  T'-frame  consists  of  a  sin¬ 
gle  frame  of  size  between  T  and  2T  -  1  steps  in  which 
the  relative  congestion  is  also  at  most  R.  | 

Lemma  3  (Lovass)  LetAi,...,Am  be  a  set  of  “bad" 
events  each  etvsmny  with  probability  p.  Suppose  that 
every  bad  er  .t  depends  on  at  most  b  other  bad  events 
(i.e.,  every  ■■  d  event  ts  mutually  independent  of  some 
set  of  m  -  i  other  bad  events).  If  <  1,  then  the 
probability  that  mo  bad  event  occurs  is  nonzero.  | 

With  these  lenunas  in  hand,  we  can  proceed  with  the 
proof. 

Lemma  4  There  is  some  way  of  choosing  the  packet 
delays  so  that  tn  between  the  first  and  last  P  steps  of 
a  block,  the  relative  congestion  in  any  frame  of  size 
7i  =  log*  7  or  greater  is  at  most  rj  =  r(l  -l-  Cj),  where 

=  0(l)/Vlog7. 

Proof:  With  each  edge  we  associate  a  bad  event.  For 
edge  e,  a  bad  event  occurs  when  more  than  rjT  packets 
use  e  in  any  T-frame  for  T  in  the  range  7i  to  27i  - 1.  To 
show  that  no  bad  event  occurs,  we  need  to  bound  both 
the  dependence  of  the  bad  events  and  the  probability 
that  an  individual  bad  event  occurs. 

We  first  bound  the  dependence.  At  most  r(27*  -1- 
27*  —  7)  packets  use  an  edge  in  the  block*.  Each  of 
these  packets  travels  through  at  most  27*  +  2P  -  I 
other  edges  in  the  block.  As  we  shall  see  later,  it  will 
always  be  true  that  r  =  r^*)  =  0(1).  Thus  a  bad  event 
depends  on  5  s  0(7*)  other  bad  events. 

Now  let  us  compute  an  upper  bound  on  the  proba¬ 
bility,  pi,  that  nx>re  than  ri7i  packets  use  an  edge  in 
a  particular  7] -frame.  Since  a  packet  may  be  delayed 
up  to  7  steps  before  the  frame,  any  packet  that  uses  e 
in  the  frame  or  in  any  of  the  7  steps  before  the  frame 
in  Si  may  use  e  after  the  delays  are  inserted  into  5, . 
Thus,  there  are  at  most  r(I  +  Ii)  packets  that  can  use 
e  in  the  frame.  For  each  of  these  the  probability  that 
the  packet  uses  e  in  the  frame  after  being  delayed  is 
at  most  (7i/7).  If  we  assume  that  no  packet  uses  an 
edge  more  than  once,  then  these  probabilities  are  inde¬ 
pendent.  Thus,  the  probability  pi  that  more  than  rili 
packets  use  the  frame  is  at  most 

Let  ri  =  r(l  -f  Cl).  Using  the  inequalities  (1-1-  z)  <  e*, 
ln(l  +  x)  >  X  —  z*/2  for  0  <  z  <  1,  and  (J)  <  (ae/i)* 
for  0  <  6  <  0,  we  have 

Pi  <  C>(e-’'^«**(*/’-*‘/*-^‘/‘‘^-*^‘/“^)). 

^Throughout  the  following  lemms*  we  make  reference*  to 
quantities  such  aa  r/  packeta  or  log*  I  time  ttept,  when  in  fact 
rl  and  log*  I  may  not  be  integral.  Rounding  th^  quantitiee  to 
integer  v^uet  when  neceaaaiy  does  not  affect  the  correctneu  of 
the  proof.  For  eaae  of  ezpoaitiao,  we  shall  henceforth  cease  to 
consider  the  issue. 


For  /i  s  log*  /  and  f  I  =  we  can  ensure  that 

P  <  !//**•  for  constant  I;:  >  0  by  making  constant 
ki'large  enough. 

Next  we  need  to  bound  the  probability  pj  that 
more  than  rili  packets  use  e  in  any  /i-frame  of  the 
block.  There  are  at  roost  0(/*)  /i-frames.  Thus 
Pi  <  0(I^)pi.  By  making  the  constant  large 
enough,  we  can  ensure  that  pj  <  1//*’,  for  any  con¬ 
stant  ks  >  0. 

The  calculations  for  frames  of  size  Ii  +  1  through 
2/i  -  1  are  similar.  There  are  at  most  0(/®)  frames 
of  any  one  size,  and  2Ii  frame  sizes  between  Ii  and 
2/i  —  1.  By  adjusting  the  constants  as  before,  we  can 
guarantee  that  the  probability  p  that  more  than  riT 
packets  use  e  in  any  T-frame  for  T  between  It  and 
2/i  -  1  is  at  most  l/f*‘  for  any  constant  ^4  >  0. 

Finally,  since  a  ba^  event  depends  on  only  4  =  0(1*) 
other  bad  events,  we  can  make  4pb  <  1  by  making  ^4 
large  enough.  By  the  Lovasz  local  lemma,  there  is  some 
way  of  choosing  the  packet  delays  so  that  no  bad  event 
occurs.  I 

Although  the  frame  size  in  the  center  of  each  block 
has  decreased,  it  has  increased  from  /  to  in  the  first 
and  last  steps  of  the  block.  To  decrease  the  frame 
size  in  these  regions,  we  move  the  block  boundaries 
to  the  centers  of  the  blocks.  Now  each  block  of  size 
27®  27’  has  a  “fuzzy”  region  of  size  27*  in  its  center 

in  which  the  relative  congestion  in  any  frame  of  size 
7*  or  greater  is  r(l  +  1/7).  In  the  7®  steps  before  and 
after  the  fuzzy  region,  the  relative  congestion  in  any 
frame  of  size  7i  or  greater  is  rj .  To  reduce  the  frame 
size  in  the  fuzzy  region,  we  assign  a  random  delay  from 
1  to  7*  to  each  packet.  A  packet  with  delay  z  waits 
once  every  7*/z  steps  in  the  7*  steps  before  the  fuzzy 
region  and  once  every  7*/(7*  —  x)  steps  in  the  7*  steps 
after  the  region.  The  rescheduled  block  now  has  size 

273  +  37*. 

We  now  show  that  there  is  some  way  of  inserting 
delays  into  the  schedule  before  the  fuzzy  region  that 
both  reduces  the  frame  size  in  the  fuzzy  region,  and 
does  not  increase  either  the  frame  size  or  the  relative 
congestion  before  the  fuzzy  region  by  much.  A  similar 
analysis  bolds  after  the  fuzzy  region. 

Lemma  5  There  i$  some  way  of  choosing  the  packet 
delays  so  that  between  steps  7  log*  7  and  steps  P,  the 
relative  congestion  i»  ony  frame  of  size  lyorjjreater  is 
at  most  rj  =  r(l+e3),  viherecj  =  0(l)/>/log7,  ond  so 
that  in  the  fuzzy  region  the  relative  congestion  in  any 
frame  of  size  It  or  greater  is  at  most  rg  =  r(l  +  £3), 
where  £3  =  0(l)/y/l^I. 

Proof:  Since  no  delays  are  inserted  into  the  fuzzy  re¬ 
gion,  the  proof  that  the  frame  size  has  been  reduced 
in  the  fuzzy  region  is  analogous  to  the  proof  of  the 
previous  lemma. 

Before  the  fuzzy  region,  the  situation  is  more  com¬ 
plex.  By  the  kth  step,  0  <  fc  <  7*,  a  packet  with  delay 
z  has  waited  zk/P  times.  Thus,  the  delay  of  a  packet 


at  the  kth  step  varies  essentially  uniformly  from  0  to 
u  =  i/7.  For  u  >  log*  7,  or  equivalently,  i  >  7  log*  7, 
we  can  show  that  the  relative  congestion  in  any  frame 
of  size  7)  or  greater  has  not  increa^  much. 

The  proof  uses  the  Lovasz  local  lemma  as  before. 
The  calculation  for  the  dependence  is  unchanged.  The 
probability  pj  that  more  than  rjli  packets  use  an  edge 
e  in  a  particular  7i-frame  is  given  by 

ri(/l+U)  /  /jr  .  \\ 

V  *  / 

Using  the  same  inequalities  as  before,  we  have 

For  It  =  log*  7,  u  >  log*  7,  it  suffices  that  £2  = 
0(l)/>/log7. 1 

For  steps  0  to  7  log*  7,  we  use  the  following  lemma 
to  bound  the  frame  size  and  relative  congestion. 

Lemma  6  The  relative  congestion  in  any  frame  of  size 
Ii  or  greater  between  steps  0  and  flog*  7  is  at  most  r^, 
where  Ii  —  log^  7  ond  r4  =  ri(l  +  l/log7). 

Proof:  The  proof  is  similar  to  that  of  Lemma  1 .  | 

We  have  now  completed  our  transformation  of  sched¬ 
ule  Si  into  schedule  St+t  ■  Let  us  review  the  relative 
congestion  and  frame  sizes  in  the  different  parts  of  a 
block  of  5,+i.  Between  steps  0  and  flog*  7,  the  rela¬ 
tive  congestion  in  any  frame  of  size  It  or  greater  is  at 
most  r4.  Between  this  region  and  the  fuzzy  region,  the 
relative  congestion  in  any  frame  of  size  It  or  greater 
is  at  most  r3.  In  the  fuzzy  region,  the  relative  con¬ 
gestion  in  any  frame  of  size  It  or  greater  is  at  most 
r3.  After  the  fuzzy  region,  the  relative  congestion  in 
any  frame  of  size  It  or  greater  is  again  r2,  until  step 
273  +  37*  -  7  log*  7,  where  the  relative  congestion  in 
any  frame  of  size  Ii  or  greater  is  r4.  For  the  entire 
block  it  is  safe  to  say  that  the  relative  congestion  in 
any  frame  of  size  =  jpg*  /  or  greater  is  at  most 
r(*+‘)  =  r(l+C>(l)/>/l^. 

The  following  theorem  shows  that  by  repeatedly  ap¬ 
plying  this  refinement  step,  we  an  construct  an  asymp- 
totic^y  optimal  schedule. 

Theorem  7  For  any  set  of  paths  with  maximum  con¬ 
gestion  c  and  mazimsm  distance  d,  there  is  a  schedule 
of  length  0(c+d)  in  which  at  most  one  packet  traverses 
each  edge  of  the  network  at  each  step,  and  at  most  0(  1 ) 
packets  wait  in  each  queue  at  each  step. 

Proof:  Without  loss  of  generality,  assume  c  =  d. 

We  begin  by  assigning  each  packet  a  random  delay 
chosen  uniformly  from  0  to  d  at  the  beginning  of  the 
greedy  schedule  St-  Using  the  Lovasz  local  lemma, 
we  can  show  that  there  is  some  way  of  choosing  the 


delays  so  that  in  the  resulting  schedule  S7,  the  relative 
congestion  is  at  most  r(')  =  0(1)  in  any  frame  of  site 
/(‘)  =  logd  or  greater. 

Next,  we  repeatedly  use  the  refining  algorithin  to 
reduce  the  frame  siie.  The  relative  congestion 
and  frame  sise  schedule  are  given  by  the 

recurrences 


.(•+0 


0(1)  _ _  ‘=1 

(•)(!  +  0(l)/^/iS^f^)  i  >  1 


and 


/  logd 

■  \  log  VO 


1  =  1 
i>  1 


which  have  solutions  =  0(1)  and  =  0(1), 
where  j  =  0(log’ d). 

We  have  not  explicitly  defined  the  values  of  r  and  I 
for  which  the  recursion  terminates.  However,  in  many 
places  we  implicitly  use  the  fact  that  I  is  sufficiently 
large  or  r  is  sufficiently  small  that  certain  inequalities 
hold.  The  recursion  terminates  when  the  first  of  these 
inequalities  fails  to  hold.  When  this  happens,  one  of  r 
or  /  is  0(1),  which  implies  that  the  other  is  also. 

Since  a  packet  waits  at  most  once  every  steps  in 
Si,  it  waits  at  most  once  every  n(l)  steps  in  Sj,  which 
implies  both  that  the  queues  in  Sj  cannot  grow  larger 
than  0(1)  and  that  the  total  length  of  Sj  is  0(d). 

Schedule  Sj  almost  satisfies  the  requirement  that  at 
most  one  packet  traverse  each  edge  in  each  step.  By 
simulating  each  step  of  Sj  in  0(1)  steps  we  can  meet 
this  requirement  with  only  a  factor  of  2  increase  in  the 
queue  size  and  a  factor  of  0(1)  increase  in  the  running 
time.  I 

Why  is  this  proof  so  complicated?  Using  the  same 
basic  ideas,  it  is  possible  to  construct  in  a  much  simpler 
fashion  a  schedule  of  length  2^^'°*'  that  uses  queues 
of  size  O(logd).  Unfortunately,  removing  the  2°^'°** 
factor  seems  to  require  delving  into  second  order  terms 
in  the  probability  calculations,  and  reducing  the  queue 
size  to  0(1)  mandates  great  care  in  spreading  delays 
out  over  the  schedule. 


3  On*line  algorithms 


3.1  An  0(c  +  dlogn)  on<line  algorithin 


By  applying  the  type  of  probabilistic  analysis  used 
in  Section  2,  it  is  fairly  straightforward  to  schedule 
any  set  of  n  paths  in  0(c  +  dlogn)  steps  with  queues 
of  size  O(logn).  We  simply  delay  the  start  of  each 
packet  by  a  random  amount  that  is  chosen  uniformly 
from  sod  then  route  all  the  packets  forward 

in  a  synchronized  fashion.  More  precisely,  we  intro¬ 
duce  the  initial  delays  and  then  consider  the  uncon¬ 
strained  schedule  without  regard  for  the  rule  that  at 
most  one  packet  traverse  any  edge  in  a  single  step. 
With  high  probability,  no  more  than  O(logn)  pack¬ 
ets  will  want  to  traverse  any  edge  at  any  step  of  the 


unconstrained  schedule.  Hence  we  can  simulate  each 
step  of  the  unconstrained  schedule  with  Oifog  n)  steps 
of  a  legitimate  schedule.  The  final  schedule  consumes 
0((d  +  1^)  log  n)  =  0(c  -I-  dlog  n)  steps  to  complete 
the  routing  and  uses  0(log  n)  size  queues. 

3.2  An  0(c+ d-f  log  n)  on-line  algorithm  for  lev¬ 
eled  networks 

Consider  any  set  of  n  leveled  paths  spanning  d  levels 
with  congestion  e.  For  simplicity,  we  will  think  of  the 
packets  as  being  distinct  (i.e.,  no  combining  will  be 
allowed)  althou^  our  analysis  can  easily  be  extended 
to  the  case  where  arbitrary  combining  t^es  place.  We 
will  allow  up  to  c  packets  to  originate  at  the  same  node 
and  to  end  at  the  same  node.  For  this  purpose,  we  will 
allow  queues  of  size  c  at  the  first  and  last  levels,  but  will 
restrict  queues  in  the  interior  levels  to  have  constant 
size  q.  The  value  of  q  can  be  any  integer  (including  1), 
and  will  affect  the  overall  routing  time  by  a  constant 
factor.  We  will  also  assume  that  the  underlying  net¬ 
work  has  indegree  and  outdegree  2,  although  the  result 
can  easily  be  extended  to  networks  with  any  constant 
degree.  In  what  follows,  we  show  how  to  route  all  the 
packets  in  0(d  -f  c  log  n)  steps  with  high  probability 
without  overflowing  any  queue. 

The  algorithnns  for  scheduling  the  packets  is  identi¬ 
cal  to  R^ade’s  algorithm  except  that  we  select  ran¬ 
dom  keys  by  which  the  packets  are  ordered  instead  of 
ordering  based  on  destination  address,  as  in  [21].  In 
particular,  each  packet  is  assigned  a  random  key  and 
a  packet  is  routed  through  a  node  only  after  all  the 
other  packets  with  lower  keys  that  are  destined  to  be 
routed  through  the  sanne  n<^e  have  done  so.  Queues 
are  placed  at  the  end  of  each  edge  and  a  packet  ad¬ 
vances  forward  only  if  there  is  already  room  for  the 
packet  in  its  next  queue.  For  simplicity  we  will  as¬ 
sume  that  the  queue  size  is  at  least  two,  so  that  once 
a  queue  contains  a  packet,  it  does  not  become  empty 
until  it  transmits  an  end-of-stream  signal.  With  mi¬ 
nor  modifications,  the  analysis  can  be  made  to  work 
with  queues  of  size  one.  To  keep  things  moving,  ghost 
messages  are  sent  along  each  edge  that  is  nut  trans¬ 
mitting  a  packet.  The  ghost  message  provides  the  best 
lower  bound  known  by  the  node  for  the  size  of  the  key 
of  the  next  packet  to  be  sent.  Ghost  messages  allow 
a  processor  to  send  a  packet  forward  from  one  incom¬ 
ing  edge  without  having  to  wait  for  actual  packets  (if 
any)  on  the  other  incoming  edges  (provided,  of  course 
that  the  ghost  messages  on  these  incoming  edges  indi¬ 
cate  that  any  such  packets  would  have  to  have  higher 
keys).  Ghost  messages  are  saved  only  if  they  arrive  at 
the  head  of  the  queue. 

To  prove  that  the  algorithm  completes  the  routing 
in  C7(c  -t-  if  -f  bg  n)  steps,  we  use  the  same  delay  path 
argument  as  RMsde  [18]  (which,  in  turn  is  quite  sim¬ 
ilar  to  the  ones  used  by  Aleliunas  [2]  and  Upfal  [24]), 
but  we  simplify  the  counting  part  of  the  analysis.  The 
simplified  counting  has  the  ^ditional  nice  feature  that 


it  allows  the  interior  queue  sizes  to  be  as  small  as  one, 
whi^  was  not  possible  with  Ranade’s  original  analysis. 
We  provide  a  sket^  of  the  argument  in  what  follows. 
The  complete  proof  will  be  included  in  the  full  draft  of 
the  p^er. 

If  some  packet  is  delayed  by  w  steps  during  the 
course  of  the  routing,  then  there  is  a  delay  path  of 
length  I  that  coincides  with  w  packet  paths  arranged 
in  (xder  of  decreasing  key  size  (working  backward).  If 
/  is  the  number  of  forward  edges  in  the  delay  path, 
then  w  >  qf/2  and  /  =  d  +  2/<d+^.  The  number 
of  possible  delay  paths  is  at  most 


n4' 


■cr) 


(2c)* 


since  there  are  n  places  that  the  path  can  start,  4'  ways 
that  it  can  continue,  ways  of  locating  the  points 
of  incidence  with  w  packet  paths,  and  (20)**  ways  to 
pick  packets  at  the  points  of  incidence. 

If  the  random  keys  are  chosen  from  [l,u;],  then  the 
probability  that  the  w  keys  for  any  delay  path  are  in  the 
right  order  is  at  most  (’*)/u)*'.  Hence,  the  probability 
that  there  is  a  delay  path  corresponding  to  w  delay  is 
at  most 


n4«('r)(2c)*>(l")  <  2^^-^^^»cc» 

'  w  ’  ' 
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For  w  =  n(d  +  c  4-  log  n),  this  probability  can  be  made 
arbitrarily  small,  even  if  f  =  1. 

Note  that  in  the  case  that  w  =  6(logn),  the  pre¬ 
ceding  argument  requires  only  0(log  n  log  log  n)  bits  of 
randomness,  since  the  keys  lie  in  the  range  [l.w],  and 
we  only  require  that  sets  of  w  keys  be  independent.^ 


4  Applications 


It  is  straightforward  to  apply  the  algorithm  described 
in  Section  3  to  route  packets  on  arrays  and  butterflies. 
For  two-dimensional  arrays,  the  paths  of  the  packets 
are  selected  greedily  with  each  packet  first  traveling  to 
the  correct  row,  and  then  to  the  correct  column.  For  ar¬ 
rays  of  higher  dimension  and  butterflies,  random  path 
selection  works  fine,  and  the  resulting  time  bounds  are 
within  a  constant  factor  of  optimal.  Some  care  must 
be  taken  to  get  around  the  queues  of  size  c  at  the  first 
and  last  levels  of  the  network,  but  this  is  not  difficult 
to  do. 

It  is  not  so  clear  how  to  apply  the  algorithm  from 
Section  3  to  route  on  networks  such  as  the  shuifie- 
exchange  or  the  deBruijn  graphs,  however.  The  reason 


*The  use  of  the  raafe  (l,  w]  for  the  random  kejri  was  sufsested 
to  us  by  Raaade  [19].  Ea^tially  the  same  ranfe  is  used  in  [21]. 
The  oonstaats  in  the  ninninf  time  can  be  reduced  by  increasins 
the  range  to  (l,n^j.  In  this  case,  the  number  of  random  bits 
required  it  0(log^  n). 


is  that  they  do  not  have  an  apparent  leveled  struc¬ 
ture.  Nevertheless,  we  can  still  obtain  a  good  routing 
algorithm  for  these  networks  by  identifying  a  leveled- 
like  structure  in  a  large  portion  of  the  shuffle-exchange 
graph.  The  details  are  included  in  Section  4.1. 

In  Section  4.2,  we  show  how  to  adapt  the  on-line 
algorithm  to  efficiently  route  on  fat-trees[13],  thus  pro¬ 
viding  the  first  examples  of  area  and  volume-universal 
networks  with  slowdown  O(log  N). 

4.1  Routing  on  the  shuffle-exchange  graph 

In  this  section,  we  will  show  how  to  apply  the  tech¬ 
niques  of  Section  3  to  obtain  a  randomized  routing 
algorithm  on  the  shuffle-exchange  graph.  It  works  for 
any  N  packet  routing  problem  with  at  most  one  packet 
starting  at  any  node  and  runs  in  0(log  S)  steps  using 
constant  size  queues. 

The  Af-node  tk%ffle-txckangt  grapk  is  defined  for  ev¬ 
ery  N  which  is  a  power  of  two.  Each  node  of  the 
(//  =  2*)-node  shuffle  exchange  graph  is  associated 
with  a  unique  k-bit  binary  string  as-i— <>o.  Two  nodes 
w  and  u/  are  linked  via  a  tknffle  edge  if  ti/  is  a  left  or 
right  cyclic  shift  of  w.  Two  nodes  u>  and  v/  are  linked 
via  an  exekange  tigt  if  w  and  v/  differ  in  the  least 
significant  bit,  no. 

A  node  of  the  shuffle-exchange  graph  is  good  if  it 
has  a  unique  longest  cyclic  substring  of  zeros  of  length 
greater  than  log  log  Af  —  1,  and  it  is  not  node  0.  (u  = 
0...0  is  not  a  good  node.)  Any  node  that  is  not  a  good 
node  is  had. 

We  group  the  nodes  into  necklaces  consisting  of 
nodes  that  are  cyclic  shifts  of  one  another.  A  neck¬ 
lace  consists  entirely  of  good  nodes  or  entirely  of  bad 
nodes  since  the  cyclic  length  of  any  substring  of  zeros 
is  unchanged  by  a  cyclic  shift.  Each  good  node  neck¬ 
lace  consists  of  log  N  nodes  since  each  cyclic  shift  is 
different  due  to  the  unique  longest  string  of  zeros. 

We  route  mainly  by  using  tbe  good  node  necklaces  as 
a  leveled  structure,  thus  we  associate  bad  nodes  with 
good  necklaces.  We  show  that  at  most  3  log  N  bad 
nodes  are  associated  with  any  good  necklace. 

Consider  that  there  are  three  types  of  bad  nodes, 

1.  nodes  having  a  longest  string  of  zeros  of  length  less 
than  log  log  Af  -  1, 

2.  nodes  with  more  than  one  group  of  longest  zeros. 

3.  and  node  u;  =  0.  ..0. 

A  bad  node  of  type  1  is  mapped  to  a  good  node 
by  making  the  least  significant  bit  a  one  and  the 
log  log  A^  —  1  most  significant  bits  to  zeros.  This  as¬ 
sociates  the  bad  node  with  the  lexicographically  min¬ 
imum  node  of  a  good  necklace,  since  after  the  trans¬ 
formation,  the  highest  order  bits  are  composed  of  the 
longest  string  of  zeros.  Only  bad  nodes  which  dif¬ 
fer  from  a  good  necklac  ’s  lexicographically  minimum 
node  in  at  most  log  log  N  bits  are  mapped  to  it,  thus  at 


most  =  log  ff  type  1  bad  nodes  are  associated 

with  any  good  necklace. 

We  map  a  bad  node  of  type  2  to  a  good  necklace 
by  mapping  a  bad  necklace  to  a  good  node  necklace. 
We  take  the  lexicographically  minimum  node  in  a  type 
2  bad  node  necklace  and  extends  its  leading  poup  of 
seros  by  exactly  one  sero.  All  the  bad  nodes  in  this 
necklace  are  associated  with  the  specified  good  neck¬ 
lace. 

To  assess  the  number  of  bad  nodes  associated  with  a 
good  necklace  by  this  operation,  we*  consider  the  lexi- 
copaphically  minimum  node  in  the  good  necklace  and 
notice  that  only  bad  necklaces  whose  minimum  node 
differs  in  the  last  bit  of  the  leading  block  of  seros  and 
possibly  differs  in  the  bit  after  that  is  mapped  to  that 
'  necklace.  Thus,  at  most  two  bad  necklaces  are  asso¬ 
ciated  with  any  good  node  and  thus  only  2  log  AT  bad 
nodes  of  type  2  are  associated  with  any  good  necklace. 

Finally,  node  0  is  simulated  by  node  1.  Note  that  no 
bad  nodes  of  type  1  or  2  are  associated  with  node  I’s 
necklace. 

So  in  all,  at  most  3logn  bad  nodes  are  associated 
with  any  good  necklace.  Recalling  that  all  good  neck¬ 
laces  contain  logA^  nodes,  we  have  ^^/4  of  the  shufBe- 
exchange  nodes  being  good.  We  use  these  nodes  to 
perform  the  bulk  of  the  routing.  The  basic  idea  is  that 
we  deterministically  route  packets  from  bad  nodes  to 
good,  then  use  a  randomized  routing  algorithm  to  route 
between  good  nodes,  and  finally  deterministically  route 
packets  from  good  nodes  to  bad.  We  proceed  by  defin¬ 
ing  a  leveled  network  on  the  good  nodes  that  the  shuffle 
exchange  graph  can  easily  simulate  with  constant  over¬ 
head.  For  any  routing  problem  on  the  good  nodes  we 
construct  paths  in  the  leveled  network  with  congestion 
and  distance  0(log7V)  with  high  probability.  By  ap¬ 
plying  the  analysis  of  Section  3,  we  can  then  complete 
the  routing  in  O(loglV)  steps  with  high  probability  us¬ 
ing  constant  sized  queues.  We  conclude  by  detailing 
the  routing  between  good  and  bad  nodes. 

4.1.1  A  levded  network 

For  each  necklace  of  good  nodes,  we  pick  the  lexico¬ 
graphically  minimum  node  to  be  the  representative 
node  for  the  necklace.  We  denote  each  good  node  by 
its  necklace’s  representation  plus  a  line  under  the  least 
significant  bit,  which  we  refer  to  as  the  cvrrent  bit  . 
For  example,  node  100011  would  be  written  as  000111. 
We  define  the  level  of  the  good  node  to  be  the  position 
of  the  underline  counting  from  the  left.  For  example, 
000111  is  in  level  4.  (Note  that  the  representative  node 
is  in  level  logJV  —  1.) 

The  leveling  of  the  nodes  just  described  induces 
a  leveling  of  the  shift  edges  but  does  not  necessar¬ 
ily  induce  a  leveling  of  the  exchange  edges.  An  ex¬ 
change  edge  even  between  good  nodes  may  create  a 
new  longest  group  of  zeros  by  joining  two  groups  of 
zeros  and  thus  connect  two  levels  which  are  very  far 
apart.  To  overconw  this  difficulty  we  assume  the  graph 


contains  /ftp  edges.  A  flip  edge  links  nodes  w  and  u/  if 
both  is  and  is'  are  good  nodes  with  is  =  ai.i...^...ao 

and  u/  =  *<><1  >*  not  in  the  leading 

block  of  zeros  of  ts’s  representative  node. 

Note  that  flip  edgM  extend  a  group  of  seros  by  at 
most  one.  Thus  no  flip  edge  can  create  a  new  leading 
group  of  zeros,  since  it  can  only  grow  the  shorter  non¬ 
leading  group  of  zeros  to  be  as  big  as  the  leading  group. 
But  then  it  would  lead  to  a  bad  node,  i.e.,  a  node 
having  two  different  longest  groups  of  seros.  This  is  a 
contradiction  since  flip  edges  only  occur  between  good 
nodes  by  definition.  Thus  flip  edges  are  leveled. 

It  is  easily  shown  that  the  operation  of  the  flip  edges 
can  be  simulated  by  the  shuffle-exchange  graph  with 
only  a  constant  slowdown;  each  flip  edge  is  composed  of 
an  exchange  edge,  a  shuffle  edge,  and  possibly  another 
exchange  ^ge. 

We  denote  by  netwwk  A  the  network  composed  of 
the  good  nodes,  the  shuffle  edges,  excluding  the  shuffle 
edges  from  level  log  Af  -  1  to  0,  and  the  flip  edges. 

Note  that  in  network  A,  from  any  level  0  good  node 
we  can  reach  any  necklace  with  a  longest  string  of  zeros 
having  the  same  or  greater  length  by  correcting  bits 
starting  from  the  end  of  the  leading  block  of  zeros.  In 
fact,  we  wish  to  be  able  to  get  from  tbe  level  0  node  of 
a  good  node  necklace  to  any  other  good  node  necklace. 

Thus  we  append  a  mirror  image  of  the  good  nodes 
with  flip  and  shuffle  edges  to  itself  so  that  we  can  reach 
necklaces  with  fewer  seros.  The  leveling  is  extended  in 
the  natural  manner.  We  call  this  whole  thing  network 
AA',  and  note  that  network  A  can  easily  simulate  it. 

We  denote  by  network  L,  the  network  consisting  of 
the  shuffle  edges  on  the  good  nodes  again  excluding 
shuffle  edges  from  level  log  Af- 1  to  level  0.  Our  method 
of  path  selection  consists  of  routing  from  a  good  node 
to  its  level  zero  node,  then  routing  to  a  random  in¬ 
termediate  necklace,  then  routing  to  the  destination 
necklace,  and  finally  routing  to  the  appropriate  good 
node.  Thus,  the  leveled  network  we  route  paths  in  is 
composed  of  network  L,  network  AA'^,  another  network 
AA',  followed  by  netwrok  L.  We  extend  the  leveling 
in  the  natural  manner  and  note  that  network  A  can 
easily  simulate  the  whole  thing. 

4.1.2  Path  selection  and  congestion 

We  assume  that  at  the  start  of  the  routing  there  at 
most  6  packets  starting  at  any  good  node.  The  value 
of  6  depends  on  the  number  of  bad  nodes  mapped  to 
any  good  node  which  we  showed  is  small. 

For  each  packet  we  choose  its  path  by  uniformly 
choosing  a  random  necklace  to  route  through  before 
going  to  its  final  destination.  So  the  path  for  a  packet 
consists  of  a  path  through  L  to  node  0  of  its  neck¬ 
lace,  the  path  through  AA''  to  its  random  intermedi¬ 
ate  necklace,  the  path  through  the  second  AA''  to  its 
destination  necklace,  and  a  path  through  the  second  L 
to  the  proper  node  of  the  necklace. 


We  show  th4t  thii  method  yields  paths  with  conges- 
ti(»  O(loglV)  with  high  probability.  That  is,  we  show 
that  the  probability  of  aay  edge  being  used  by  more 
than  clog  packets  is  0(^7frT)  for  some  constant  c. 

We  obwrve  that  for  the  paths  in  the  copies  of  L,  we 
have  congestion  6IogA^,  since  at  most  blogN  packets 
start  or  end  in  any  good  necklace.  By  symmetry  we 
claim  that  the  andysis  of  the  path  portions  in  both 
copies  of  AA**  is  the  same.  Finally  we  recall  that  in 
AA',  we  route  packets  going  to  necklaces  with  same 
or  more  seros  to  the  appropriate  necklace  in  network 
A  and  straight  across  network  A' ,  we  route  the  other 
packets  straight  across  in  network  A  and  use  A'  to 
route  to  the  proper  necklace.  We  will  show  that  any 
destination  necklace  gets  0(c  log  n)  packets  with  high 
probability,  so  the  straight  across  portion  of  the  paths 
should  not  be  a  problem.  To  finish,  we  give  the  analysis 
of  the  congestion  due  to  packets  in  just  network  A,  and 
claim  that  the  arguments  will  hold  be  symmetry  for  A'. 

Consider  an  edge  in  the  first  copy  of  network  A. 
In  this  half,  packets  going  to  necklaces  with  fewer  ze¬ 
ros  are  routed  straight  across  their  starting  necklace. 
There  are  at  most  b\ogN  of  these,  so  without  loss  of 
generality  we  ignore  them.  Suppose  that  e  traverses 
levels  m  and  m  -f  1.  Let  x  be  the  number  of  zeros 
in  the  necklace  to  which  e  goes.  If  m  <  z,  then  no 
packet  from  any  other  necklace  uses  e,  since  we  only 
map  to  a  necklace  via  flip  edges  after  its  longest  string 
of  zeros.  Otherwise,  we  consider  the  number  of  pack¬ 
ets  from  other  necklaces  that  can  use  e.  We  know 
that  only  packets  from  at  most  2^  other  necklaces  with 
1  =  m  -  log  log  N  could  have  used  e  since  at  most 
/  bits  could  have  changed  by  level  m  +  1.  Thus  the 
number  of  packets  that  can  use  e  is  at  most  2'b  log  N 
since  each  necklace  starts  with  at  most  6  log  n  packets. 
The  probability  that  a  specific  packet  uses  e,  is  the 
numlxr  of  necklaces  that  can  be  reached  using  e,  at 
most  (i  e..  necklaces  which  match  e’s 

necklace  in  the  first  /  +  log  log  n  bits),  divided  by  the 
total  number  of  good  necklaces,  at  least  (since 

a  constant  fraction  of  the  nodes  are  good  and  there  are 
log  N  nodes  in  a  necklace),  which  is  just  Thus  the 
probability  that  more  than  c  log  n  packets  use  e  can  be 
written 

■f  m  (^)' 

The  first  factor  in  each  term  gives  the  number  of  ways 
to  choose  the  packets.  The  second  is  the  probability 
that  all  these  packet  use  e.  The  sum  is  bounded  by 
0(7^)  if  we  choose  c  >  ^.  Thus,  the  probability 
that  any  of  the  0{ff)  edges  of  this  stage  has  conges¬ 
tion  more  than  clogA^  is  then  clearly  Oifjirr)-  For 
large  enough  c,  this  gives  the  desired  high  probability 
result  O(jtf).  This  argument,  also  provides  the  proof 
that  any  random  destination  necklace  receives  c  log  n 
packets  with  high  probability,  since  we  need  only  con¬ 


sider  the  congestion  on  the  edge  from  A  to  A'. 

We  are  finished  showing  how  to  route  packets  be¬ 
tween  the  good  nodes  in  a  leveled  fashion  with  path 
congestion  and  distance  O(logAf)  with  high  probv 
bility.  Thus,  by  the  arguments  of  section  3  we  can 
solve  any  routing  problem  on  the  good  nodes  in  time 
0(\og  N)  using  constant  sized  queues. 

4.1.3  Packets  from  bad  nodes 

In  this  section  we  show  how  to  deterministically  route 
a  bad  node’s  packet  to  its  associated  good  node. 

Recall  that  we  associated  a  bad  node  of  type  1  with 
the  necklace  represented  by  a  one  in  the  least  signifi¬ 
cant  or  current  bit  plus  log  \ogff  -I  zeros  in  the  most 
significant  bits.  We  route  these  packets  in  the  shuffle 
exchange  graph  by  flipping  the  current  bit  to  a  one  and 
flipping  log  log  AT  - 1  bits  to  the  right  to  zeros.  Thus  we 
map  a  bad  node  to  a  good  necklace  at  its  level  log  log  N 
node. 

For  any  necklace,  we  have  a  binary  tree,  the  leaves 
of  which  are  mapped  to  the  necklace.  Each  level  of 
the  tree  corresponds  to  one  of  the  log  log  N  bits  that 
were  flipped.  Therefore,  we  can  route  packets  from  the 
binary  tree  leaves  to  the  necklace,  and  distribute  them 
along  the  necklace  deterministically.  This  is  easily  done 
in  log  N  time  with  constant  queues.  The  routing  from 
the  necklace  to  the  tree  is  equally  trivial.  But,  we  need 
to  ensure  that  traffic  from  the  separate  binary  trees 
does  not  interfere  too  much.  This  is  easy  since  any 
bad  node  is  in  at  most  two  binary  trees;  in  at  most 
one  as  a  leaf  since  any  node  is  mapped  to  exactly  one 
good  node,  and  in  at  most  one  as  an  internal  node  since 
the  number  of  zeros  between  the  current  node  and  the 
closest  one  to  the  left  determines  a  unique  level  and 
the  rest  of  the  bits  determine  a  unique  tree. 

To  finish,  we  consider  unleveled  nodes  of  type  2. 
These  are  nodes  without  a  unique  longest  string  of  ze¬ 
ros.  Here  we  extend  one  of  the  groups  of  zeros  by  one 
zero,  making  sure  not  to  join  two  groups  of  zeros  by 
inserting  a  one  if  necessary,  i.e.,  mimicking  the  flip  op¬ 
eration.  For  any  good  necklace  whose  representative  is 
0*1...  only  the  necklaces  represented  by  0*~'10...  and 
0*~‘ll...  can  be  mapped  to  it.  Again,  at  most  two  bad 
necklaces  are  associated  with  any  good  necklace. 

For  each  packet  in  such  a  bad  necklace  we  route  it 
through  the  node  connecting  it  to  the  appropriate  good 
necklace.  We  perform  this  movement  by  pipelining  the 
packets  through  the  edge  which  connects  the  two  neck¬ 
laces.  We  see  that  this  mapping  maps  at  most  one 
packet  from  the  bad  necklace  to  a  node  in  the  good 
necklace.  Since  we  are  basically  routing  on  linear  ar¬ 
rays  of  length  at  most  21ogAf,  21ogAf  steps  suffice  to 
route  the  packets  appropriately.  4  log  N  steps  is  suffi¬ 
cient  to  route  the  packets  from  two  bad  necklaces. 

This  finishes  the  description  of  the  maps  to  and  from 
all  the  unleveled  nodes  except  for  the  node  w  =  0, 
which  is  easily  routed  to  node  1 . 


4.3  Construction  of  area  and  volume-univertal 
networks 

In  this  section  we  construct  a  class  of  point-to-point 
networks  that  are  arrs-snivcrsa/in  the  sense  that  a  net¬ 
work  in  the  class  with  N  processors  has  area  0{N)  and 
can,  with  high  probability,  simulate  in  0(log  N)  steps 
each  message-step  of  any  shared-bus  network  of  area 
0(N).  The  simulation  is  optimal  because  a  point-to- 
point  network  may  require  n(logAr)  steps  to  simulate 
one  step  of  a  shared-bus  network.  The  networks  are 
based  on  the  fat-trees  of  Greenberg  and  Leiserson  [5] 
and  the  simulation  uses  the  message  routing  algorithm 
from  Section  3. 

In  a  fixed-connection  network,  processors  communi¬ 
cate  via  wires.  Eshch  processor  has  a  bounded  number 
of  read  and  write  pins.  In  a  point-to-point  network, 
each  wire  connects  one  read  pin  with  one  write  pin. 
In  each  message-step,  the  processor  with  the  write  pin 
may  transmit  a  message  of  G(log  A^)  bits  to  the  pro¬ 
cessor  with  the  read  pin.  In  a  shar^-bus  network,  a 
wire  may  connect  many  read  and  write  pins.  Such  a 
wire  is  called  a  bus.  In  each  message-step,  any  pro¬ 
cessors  wishing  to  send  messages  make  them  av^able 
on  their  write  pins.  Then  the  messages  at  the  write 
pins  of  each  wire  are  combined  by  some  simple  rule  to 
form  a  single  message.  Combining  is  assumed  to  re¬ 
quire  a  single  message-step,  regardless  of  the  number 
of  messages  combined  or  the  rule  used. 

Leiserson  was  the  first  to  display  a  class  of  fixed- 
connection  networks  that  could  efficiently  simulate  any 
other  network  of  the  same  ares  or  volume.  In  [13]  he 
showed  that  a  fat-tree  of  area  0(N)  can  simulate  in 
0(log^  N)  bit-steps  each  bit-step  of  any  fixed  connec¬ 
tion  network  of  area  OiN).  The  simulation  used  an 
off-line  routing  algorithm  for  fat-trees.  On-line  rout¬ 
ing  algorithms  were  later  developed  by  Greenberg  and 
Leiserson  [5]  and  Park  [16].  None  of  these  routing  algo¬ 
rithms  are  capable  of  combining  messages  to  the  same 
destination.  As  a  consequence,  no  scheme  for  simulat¬ 
ing  shared-bus  networks  was  known  until  now.  A  net¬ 
work  that  can  simulate  in  0(1)  steps  each  step  of  any 
shared-bus  network  area  of  equal  area  was  presented 
in  [15].  However,  the  connections  in  this  network  are 
not  fixed,  but  instead  processors  conununicate  via  re- 
configurable  busses. 

A  fat-tree  network  is  shown  in  Figure  1.  Its  underly¬ 
ing  structure  is  a  complete  4-ary  tree.  Each  edge  in  the 
4-ary  tree  corresponds  to  a  pair  of  oppositely  directed 
groups  of  wires  called  channels.  The  channel  directed 
from  the  leaves  to  the  root  is  called  an  up  channel;  the 
other  is  caUed  a  down  channel.  The  capacity  of  a  chan¬ 
nel  c,  cap(c),  is  the  number  of  wires  in  the  channel.  We 
call  the  tree  ‘Tat”  because  the  capacities  of  the  chan¬ 
nels  grow  by  a  factor  of  2  at  every  level.  A  fat-tree  of 
height  m  has  A/*  =  2*'"  leaves  and  Af  =  2"*  vertices 
at  the  root. 

It  will  prove  useful  to  label  the  switches  at  the  top 
and  bottom  of  each  channel.  Let  the  level  of  a  switch 
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Figure  1:  A  fat-tree. 


be  its  distance  from  the  leaves.  Suppose  a  channel  c 
connects  eap(e)/2  =  2'  switches  at  level  I  with  cap(c)  = 
2'*^  switches  at  level  f  -f  1.  Give  the  switches  at  level 
/  labels  0  through  2^-1  and  the  switches  at  level  /  -i- 1 
labels  0  through  2'*^  -  1.  Then  switch  i  at  level  /  is 
connected  to  switches  k  and  k  -f-  2'  at  level  I  +  1.  The 
following  lenruna  relates  the  labels  of  the  switches  on  a 
message’s  path  from  a  leaf  to  the  root. 


Lenuna  8  There  is  a  uniqne  shortest  path  from  any 
leaf  to  a  switch  labeled  k  at  the  root,  for  0  <  k  <  Af  —  1, 
and  that  path  passes  through  a  switch  labeled  k  mod  2‘ 
at  level  I,  for  0  <  /  <  m.  | 

For  a  set  Q  of  messages  to  be  delivered  between  the 
leaves  of  the  fat-tree,  we  define  the  load  of  Q  on  a 
channel  c,  load(Q,c),  to  be  the  number  of  destinations 
of  messages  in  Q  for  which  at  least  one  message  must 
pass  through  c.  Note  that  even  if  many  messages  with 
the  same  destination  must  pass  through  a  channel,  that 
destination  contributes  at  most  one  to  the  load  of  the 
channel.  We  define  the  load  factor  of  Q  on  c,  X{Q,c). 
to  be  the  ratio  of  the  load  of  Q  on  c  to  the  capacity 
of  c,  A(Q,c)  =  load(Q,c)/cap(c).  The  load  factor  on 
the  entire  network,  X(Q)  is  simply  the  maximum  load 
factor  on  any  channel  X{Q)  =  maxc  X{Q,  c).  The  load 
factor  is  a  lower  bound  on  the  the  number  of  steps 
required  to  deliver  Q.  We  shall  assume  that  X  <  Af*. 
where  k  is  some  fixed  constant.  We  shall  sometimes 
write  A  to  denote  X(Q)  when  the  set  of  messages  to  be 
delivered  is  clear  from  the  context. 

In  a  leveled  fat-tree  a  switch  at  the  top  of  an  up 
channel  at  level  I  is  connected  to  itself  at  the  top  of 
the  corresponding  down  channel  by  a  linear  chain  of 
switches  of  length  2(m  - 1).  A  message  may  only  make 
a  transition  from  an  up  channel  to  a  down  channel  by 
traversing  a  chain.  Thus  all  shortest  paths  between 
leaves  in  a  leveled  fat-tree  have  length  2m.  Note  that 
the  load  of  a  set  of  messages  on  a  channel  of  the  leveled 
fat-tree  is  identical  to  the ‘load  on  the  corresponding 
channel  in  the  fat-tree. 

The  path  that  a  message  for  destination  x  in  column 
2m  takes  through  a  leveled  fat-tree  is  determined  by 


the  m-universal  hash  fuiictioD[4] 
path(x)  = 

where  P  is  a  prime  number  larger  than  the  number 
of  possible  different  destinations,  and  the  Oj  €  Zp  are 
chosen  at  random  off-line.  A  message  with  destina¬ 
tion  X  follows  up  channels  until  it  can  reach  z  without 
using  any  more  up  channels.  It  then  crosses  over  to 
a  down  channel  via  a  chain,  and  follows  down  chan¬ 
nels  to  z.  Note  that  a  message  only  passes  through 
a  channel  if  it  must.  Also,  all  messages  with  destina¬ 
tion  z  that  pass  through  channel  c  pass  through  switch 
(path(z)  mod  cap(c))  at  the  top  of  c  and  through 
switch  (path(z)  mod  (cap(c)/2))  at  the  bottom  of  c. 

The  following  lemma  shows  that  we  can  use  the 
scheduling  algorithm  from  Section  3  to  route  messages 
in  a  fat-tree. 

Lemma  9  For  any  eonsiani  ci ,  ihtrt  i»  a  constant  cj 
such  that  the  probability  that  the  number  of  steps  re¬ 
quired  to  deliver  a  set  Q  of  messages  with  load  factor 
A  is  more  than  C3(A  -i-  logAf)  is  at  most 

Proof:  The  paths  of  the  messages  are  first  random¬ 
ized  using  the  universal  hash  function  path.  With 
high  probability,  the  resulting  congestion  is  c  =  0(A  -i- 
logM).  Each  message  travels  a  distance  of  d  =  2m  = 
21ogM.  The  messages  are  then  scheduled  using  the 
algorithm  from  Section  3.  | 

Let  us  now  consider  the  VLSI  area  requirements 
[23]  of  fat-trees.  A  fat-tree  with  root  capacity 
M  and  6(M^)  processors  has  a  layout  with  area 
0{M^  log^  M )  that  is  obtained  by  embedding  the  fat- 
tree  in  the  tree  of  meshes(10].  The  nodes  of  the  tree 
of  meshes  in  this  layout  are  separated  by  a  distance 
of  Ig  M  in  both  the  horizontal  and  vertic^  directions. 
Thus,  the  6(log  M)  space  for  the  chain  associated 
with  each  processor  in  the  leveled  fat-tree  can  be  al¬ 
located  without  increasing  the  asymptotic  area  of  the 
layout.  (In  fact,  it  is  possible  to  attach  a  chain  of  size 
0(log^  M)  to  each  fat-tree  node  without  increasing  the 
afea  by  more  than  a  constant  factor.)  The  leaves  of  the 
fat-tree  are  separated  in  the  layout  from  each  other  by 
a  distance  of  IgM  in  each  direction.  We  can  improve 
the  density  of  processors  without  increasing  the  asymp¬ 
totic  area  of  the  layout  by  connecting  a  \gM  x  IgM 
mesh  of  processors  to  each  leaf.  The  resulting  network 
has  0(A/*  log*  M)  processors  and  area  0(Af*  log*  M). 
The  yV-processor  network  in  this  class  has  root  capacity 
©(v^/IogW),  0(JV/log*  N)  leaves,  and  ares  Q{N). 

The  following  theorem  shows  that  this  class  of  net¬ 
works  is  arevuniversal. 

Theorem  10  With  high  probability,  an  N -processor 
point-to-point  fixed-connection  network  U  of  area  9{N) 
can  simulate  in  0(log  N)  steps  each  step  of  any  shared- 
bus  fixed- connection  network  B  of  area  0{N). 


Proof:  The  processors  of  the  shared-bus  network  B 
are  mapped  to  the  processors  of  the  area-universal  net¬ 
work  U  off-line  using  a  recursive  decomposition  tech¬ 
nique  as  in  [13].  In  each  step,  a  wire  of  B  is  simu¬ 
lated  by  routing  messages  between  the  processors  that 
it  connects.  At  each  level  of  the  recursion  at  most 
0(cap(c)  •  log  N)  wires  connect  the  processOTs  mapped 
below  a  channel  c  with  the  rest  of  the  network.  This 
property  of  the  m^>ping  ensures  that  the  load  factor 
of  each  set  of  messages  used  in  the  simulation  of  B 
is  at  most  0(log  N).  At  the  bottom  of  the  decompo¬ 
sition  tree,  a  O(ioglV)  x  O(logW)  region  of  the  lay¬ 
out  of  B  is  mapped  to  each  leaf  of  the  fat-tree.  The 
0(log  N)  X  O(log  N)  mesh  connected  to  the  leaf  in  U 
simulates  this  region  of  B  using  standard  mesh  routing 
algorithms.  | 

The  study  of  fat-tree  routing  algorithms  that 
perform  combining  was  motivated  in  part  by  an 
abstraction  of  the  volume  and  area-universal  net¬ 
works  called  the  distributed  random-access  machine 
(DRAM).  A  host  of  conservative  algorithms  for  tree 
and  graph  problems  for  the  exclusive-read  exclusive- 
write  (EREW)  DRAM  are  presented  m  [14].  Re¬ 
cently  we  discovered  conservative  concurrent-read 
concurrent-write  (CRCW)  algcxithms  that  require 
fewer  steps  for  some  of  these  problenos.  Until  now, 
however,  no  eflScient  fat-tree  routing  algorithms  that 
perform  combining  were  known.  The  0(A-t-log  A/’)  step 
routing  algorithm  presented  here  fills  the  void. 

Only  slight  modifications  to  the  area-universal  fat- 
tree  are  necessary  to  make  it  volume  univer8al[5].  The 
underlying  structure  of  the  volume-universal  fat-tree  is 
a  complete  8-ary  tree.  Instead  of  doubling  at  each  level, 
the  channel  capacities  increase  by  a  factor  of  4.  The 
tree  has  m  levels,  root  capacity  M  s  2*’",  and  Af*/*  = 
2^  leaves.  The  switches  at  the  top  of  a  channel  at 
level  /  are  labeled  0  through  4'  -  1.  Switch  k  at  level  / 
is  connected  to  switches  fc,  k-f4',  t-f  2  •4^  and  k-f  3  -4' 
at  level  / -hi.  A  layout  with  volume  0(Af*/* log*^*  M) 
for  the  fat-tree  can  be  obtained  by  embedding  it  in  the 
three-dimensional  tree  of  meshes.  As  before,  a  chain 
of  size  0(log’^*  M)  can  be  attached  to  each  node  of 
the  fat-tree  without  increasing  the  asymptotic  layout 
area  and  the  density  of  processors  can  be  improved  by 
connecting  a  Ig*^*  M  x  lg‘^*  M  x  Ig*^*  M  mesh  to  each 
leaf. 

5  Counterexamples  to  on-line  algo¬ 
rithms 

In  this  section  we  give  examples  where  several  on-line 
scheduling  strategies  do  poorly.  Based  on  these  exam¬ 
ples,  we  suspect  that  finding  an  on-line  algorithm  that 
can  schedule  any  set  of  paths  in  0{c  +  d)  steps  using 
constant  size  queues  will  be  a  challenging  task. 

In  the  first  example,  we  describe  an  N-node  network 
in  which  a  set  of  packets  with  maximum  congestion  and 
maximum  distance  0(1)  requires  n(log*  AT/ log  log  A' ) 
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Figure  3:  Example  1. 


steps  to  be  delivered  using  the  strategy  of  Section  3. 
This  example  does  not  contradict  the  results  of  Section 
3,  since  the  network  has  0(log*  N)  levels.  However, 
it  shows  that  reducing  the  maximum  congestion  and 
maximum  distance  below  the  number  of  levels  will  not 
necessarily  improve  the  running  time. 

Observation  11  For  ike  atrategf  of  Section  3,  there 
ia  an  N~node  directed  acyclic  network  of  degree  3  and 
a  aet  of  patha  with  maximum  congeation  c  =  3  and 
maximum  diatance  d  =  3  where  the  expected  length  of 
the  achedule  ia  n(log*  A^/ log  log  TV). 

Proof:  The  network  consists  of  many  disjoint  copies  of 
the  subnetwork  pictured  in  Figure  2.  The  subnetwork 
is  composed  of  k/  log  k  linear  chains  of  length  k,  where 
k  shall  later  be  shown  to  be  ©(log  TV).  The  second 
node  of  each  linear  chain  is  connected  to  the  second 
to  last  node  of  the  previous  chain  by  a  diagonal  edge. 
We  assume  that  at  the  end  of  each  edge  there  is  a 
queue  that  can  store  2  packets.  Initially,  the  queue 
into  the  first  node  of  each  chain  contains  an  (EOS) 
end-of-stream  signal  and  one  packet,  and  the  queue 
into  the  second  node  contains  two  packets.  A  packet’s 
destination  is  the  last  node  in  the  previous  chain.  Each 
packet  takes  the  diagonal  edge  to  the  previous  chain 
and  then  the  last  edge  in  the  chain.  Thus,  the  length 
of  the  longest  path  is  d  =  3. 

When  the  ranks  r\,..  .,rni\^^  of  the  packets 
Pi ,  ■  •  • .  Pst/  log  k  chosen  so  that  r,  <  r,+i  for  1  <  «’  < 
Zkf  log  *,  packet  pst/w  t  requires  0(4’/  log  k)  steps  to 
reach  its  destination.  The  scenario  unfolds  as  follows. 
Packets  pi  and  pj  take  a  diagonal  edge  in  the  first  two 
steps.  These  packets  cannot  advance  until  the  EOS 
reaches  the  end  of  the  first  chain,  in  step  k.  In  the 
meantime,  ghosts  with  ranks  ri ,  rj,  and  rg,  travel  down 
the  second  chain,  hut  packet  pg  blocks  an  EOS  signal 
from  traveling  down  the  chain.  Packets  P4  and  pg  are 
waiting  for  this  EOS  signal.  They  cannot  advance  until 
step  24.  In  this  fashion,  the  delay  is  propagated  down 
to  packet  pgg/ logs. 

A  simple  calculation  reveals  that  the  probability  that 
n  <  r,+i  for  1  <  «  <  34/ log  4  is  l/2®<*).  Thus,  if 
we  have  2^^*^  copies  of  the  subnetwork,  we  expect  the 
ranks  of  the  packets  to  be  sorted  in  one  of  them.  For 


the  total  number  of  nodes  in  the  network  to  be  TV, 
we  need  4  =  ©(log  N).  In  this  case,  we  expect  some 
packet  to  be  delayed  n(log’TV/loglogA^)  steps  in  one 
copy  of  the  subnetwork.  | 

It  is  somewhat  unfair  to  say  that  the  optimal  sched¬ 
ule  for  this  example  has  length  0(c  -f  d)  =  0(1). 
since  ghosts  and  EOS  signals  must  travel  a  distance 
of  ©(log  TV).  However,  even  if  the  EOS  signals  are  re¬ 
place  by  packets  with  the  appropriate  ranks,  the  max¬ 
imum  distance  is  only  0(k>g  TV),  and  thus  the  optimum 
schedule  has  length  0(log/V). 

The  second  example  is  quite  general.  The  following 
observation  shows  that  for  any  deterministic  strategy 
in  which  the  order  in  which  packets  are  chosen  to  pass 
through  a  switch  is  independent  of  the  future  paths 
of  the  packets,  there  is  a  network  and  a  set  of  paths 
with  maximum  congeation  c  and  maximum  distance  d 
in  which  the  schedule  produced  has  length  cd.  This  ob¬ 
servation  covers  strategies  such  as  giving  priority  to  the 
packet  that  has  spent  the  most  (or  least)  time  waiting 
in  queues,  and  giving  priority  to  the  packet  that  vrives 
first  at  a  switch.  The  network  has  the  disadvantage  of 
having  degree  c  and  size  c^. 

Observation  12  For  any  determiniatic  atrategy  in 
which  the  order  in  which  packeta  are  choaen  to  paaa 
through  a  awitch  doea  not  depend  on  the  patha  that  the 
packeta  take  after  they  paaa  through  the  awitch,  there 
ia  a  network  and  a  aet  of  patha  with  congeation  c  and 
maximum  diatance  d  for  which  the  achedule  produced 
haa  length  cd. 

Proof:  We  construct  the  example  for  congestion  c  and 
maximum  distance  d,  £(c,d),  recursively.  The  network 
consists  of  c  copies  of  the  network  for  E{e,  d- 1)  fee  ding 
into  a  single  edge  e.  For  each  copy  of  E{c,d  -  1).  the 
strategy  schedules  some  packet  to  arrive  at  e  last.  We 
extend  the  path  of  this  packet  so  that  it  traverses  t  m 
E{c,  d).  The  maximum  distance  of  the  new  set  of  pat  hs 
is  d  and  the  congestion  c.  The  length  of  the  schedu.  ’ 
T{c,d),  is  given  by  the  recurrence 


T(c,d) 


T(c,  d  —  1)  +  c 


d>  1 
d=  I 


and  has  solution  ^(c.d)  =  cd.  | 

The  third  and  fourth  examples  show  that  simple 
scheduling  strategies  fail  even  in  much  smaller  net¬ 
works. 

Observation  13  For  the  atrategy  in  which  the  packet 
wtth  the  fartheat  diatance  left  to  travel  or  the  farthest 
total  diatance  to  travel  is  given  priority,  there  u  an 
N~node  network  with  diameter  O(V^)  and  a  aet  of 
patha  with  congeation  O(V^)  and  maximam  diatance 
0(y/N)  for  which  the  achedule  produced  haa  length 
n(TV). 

Proof:  The  network  consists  of  y/S /2  linear  chains 
Chain  i  is  composed  of  VN  -f  i  nodes.  It  meets  chain 


d 


Flgur*  4:  Exan^le  4. 


Figure  3:  Example  3. 

i  -  1  at  node  t  and  at  every  second  node  after  that  up 
to  node  \ffi  —  i.  Thus  chain  t  and  t  ~  1  share  their  ith, 
t  +  2tb,...>/^  —  tth  nodes.  Figure  3  shows  the  network 
for  s  8. 

We  have  'JH  packets  starting  in  a  queue  at  the  first 
node  of  each  chain.  Each  packet  simply  traverses  its 
chain  to  the  the  end.  We  assume  that  any  queue  has 
unlimited  size. 

Note  that  the  packets  in  chain  i  have  higher  priority 
than  those  in  the  chain  t  -  1  whenever  they  meet  since 
the  chain  i  packets  travel  one  farther  than  those  in 
chain  t  -  1. 

We  claim  that  at  every  meeting  point  between  chain 
t  —  1  and  1,  the  packets  in  chain  i  —  1  are  delayed  by  all 
the  packets  in  chain  t.  This  implies  the  theorem  since 
the  packets  in  the  first  chain  would  be  delayed  by  \/S 
packets  at  each  of  y/H /2  meeting  points,  resulting  in 
a  total  delay  of  n(^). 

We  prove  the  claim  by  induction  on  chain  number 
and  the  number  of  meeting  points.  It  certainly  holds 
for  the  last  two  chains,  i.e.,  the  nodes  of  the  last  chain 
arrive  at  the  single  meeting  point  at  the  same  time  as 
those  of  the  second  to  last  and  have  higher  priority.  So 
we  assume  that  it  is  true  for  the  chain  t  and  wish  to 
prove  that  it  is  true  for  the  chain  t  -  1. 

At  the  first  meeting  point  of  the  chains  the  packets 
arrive  at  the  same  time  since  chain  <  has  not  met  any 
other  chain  and  the  packets  in  chain  t  -  1  are  not  de¬ 
layed  by  any  packet  to  the  left.  Thus  the  packets  in 
chain  i  —  1  are  delayed.  To  finish,  we  assume  that  the 
packets  in  chain  i  —  1  have  been  delayed  for  the  first  j 
meeting  points  and  claim  that  the  chain  i  -  1  packets 
meet  the  chain  i  packets  at  the  j  -t-  1st  meeting  point, 
since  chain  t  —  I’s  packets  have  been  delayed  in  the 
intervening  node.  | 

Observation  14  For  the  strategy  of  assigning  each 
packet  a  random  rank  and  giving  priority  to  the  packet 
with  the  lowest  rank,  there  is  an  N-node  network  with 
diameter  0(log N/ loglog M)  and  a  set  of  paths  with 
maximnm  distance  d  =  0(logAr/loglogiV)  and  con~ 
gestion  c  =  0(log^/logIog^f)  wkere  <Ae  expected 
length  of  the  schedule  is  n((log^/logIogAf)*/’). 

Proof:  In  this  example  we  assume  that  queues  have 


unlimited  capacity.  We  also  assume  that  each  node  can 
only  send  a  message  cm  a  single  output  edge,  without 
loss  of  generality. 

The  network  again  consists  of  many  copies  of  a  sub¬ 
network. 

We  construct  our  subnetwork  so  that  d  =  c  = 
k/logk.  The  subnetwork  consists  of  a  linear  chain  of 
length  d,  with  loops  of  length  y/d  between  adjacent 
nodes  (see  figure  4).  Of  the  d  packets,  the  highest  pri¬ 
ority  y/d  use  the  first  y/d  loops  as  their  path.  The  next 
highest  priority  y/d  packets  use  the  linear  chain  for  y/d 
steps  and  then  use  >/3  ~  1  loops  as  their  path,  and  so 
on. 

It  is  easily  seen  that  the  tth  group  of  y/d  pack¬ 
ets  delays  the  packets  with  lower  priority  by  d  -  iVd 
steps.  Thus  the  last  packet  experiences  an  Q(dy/3)  = 
0((fc/ log  *)»/»)  delay. 

Once  again  we  need  the  packets  to  be  in  some  spe¬ 
cific  order,  which  can  be  shown  to  happen  with  high 
probability  given  enough  copies  of  the  subnetwork.  As 
in  Observation  11,  it  is  not  hard  to  show  this  requires 
*  =  e(iogiv).| 


6  Remarks 

The  scheduling  algorithm  from  Section  3  can  be  used 
as  a  subroutine  in  algorithms  for  sorting  and  emulat¬ 
ing  shared  memory  machines  on  bounded  degree  net¬ 
works.  By  using  the  algorithm  in  place  of  the  routing 
algorithm  in  [21],  it  is  possible  to  sort  N  packets  in 
0(log  N)  steps  on  an  Af-node  butterfly  using  constant 
size  queues.  (This  observation  has  been  made  previ¬ 
ously  by  Pippenger  [17],  Ranade  [19],  and  Reif  [20].) 
A  shared  memory  machine  with  a  large  address  space 
can  be  emulated  by  randomly  hashing  the  memory  lo¬ 
cations  to  the  nodes  of  a  butterfly  as  in  [6]  and  [18]. 
The  hashing  ensures  that  the  congestion  of  the  packets 
implementing  each  memory  access  step  is  small.  The 
algorithm  from  Section  3  can  be  used  to  schedule  the 
the  movements  of  these  packets.  A  more  complete  de¬ 
scription  of  these  applications  will  be  provided  in  the 
full  paper. 


Acknowledgments 

Thanks  to  Jon  Greene,  Johan  Hastad,  Charles  Leiser- 

son,  Nick  Pippenger,  and  Abhiram  Ranade  for  helpful 

discussions.  Thanks  to  Tom  Cormen  for  producing  the 

figures. 

References 

[1]  M.  Ajtai,  J.  Komlos,  and  E.  Szemeredi,  "An 
O(Aflog^)  sorting  network,”  Pncttixngt  of  ike 
15th  Annua/  ACM  Symposium  on  ike  Theory  of 
Computing,  1983,  pp.  1-9. 

[2]  R.  Aleliunas,  "Randomized  parallel  communica> 
tion,”  Proceedings  of  the  ACM  SIGACT-SIGOPS 
Symposium  on  Principles  of  Distributed  Comput¬ 
ing,  August  1982,  pp.  60-72. 

[3]  K.  Batcher,  "Sorting  networks  and  their  applica¬ 
tions,”  Proe.  AFIPS  Spring  Joint  Compui.  Conf, 
1968,  Vol.  32,  pp.  307-314. 

[4]  J.  L.  Carter  and  M.  N.  Wegman,  “Universal 
classes  of  hash  functions,”  Journal  of  Computer 
and  System  Sciences,  Vol.  18.,  1979,  pp.  143-154. 

[5]  R.  I.  Greenberg  and  C.  E.  Leiserson,  "Randomized 
routing  on  fat-trees,”  Advances  tn  Computing  Re¬ 
search,  Vol.  5,  Randomness  and  Compu/ation,  S. 
Micali,  ed.,  JAI  Press,  Greenwich,  CT,  1988,  to 
appear. 

[6]  A.  R.  Karlin  and  E.  Upfal,  "Parallel  hashing  — 
an  efficient  implementation  of  shared  memory,” 
Proceedings  of  the  18ih  Annual  ACM  Symposium 
on  the  Theory  of  Computing,  May  1986,  pp.  160- 
168. 

[7]  0.  Krizanc,  S.  Rajasekaran,  and  Th.  Tsantilis, 
"Optimal  routing  algorithms  for  mesh-connected 
processor  arrays,"  VLSI  Algorithms  and  Architec¬ 
tures  (AWOC  88),  J.  Reif,  ed..  Lecture  Notes  in 
Computer  Science  319,  1988,  pp.  411-422. 

[8j  M.  Kunde,  "Routing  and  sorting  on  mesh- 
connected  arrays”,  VLSI  Algorithms  and  Archi¬ 
tectures  (AWOC  88),  J.  Reif,  ed..  Lecture  Notes 
in  Computer  Science  319,  19^,  pp.  423-433. 

[9]  F.  T.  Leighton  and  F.  Makedon  and  I.  Tollis,  “A 
2N  —  2  step  algorithm  for  routing  in  an  AT  x  Af 
mesh,”  unpublished  manuscript. 

[10]  F.  T.  Leighton,  Complexity  Issues  in  VLSI,  MIT 
Press,  Cambridge,  MA,  1983. 

[11]  F.  T.  Leighton,  “Tight  bounds  on  the  complexity 
of  parallel  sorting,”  IEEE  Transactions  on  Com¬ 
puters,  Vol.  C-34,  No.  4,  April  1985,  pp.  344-354. 

[12]  T.  Leighton  and  S.  Rao,  "An  approximate  max- 
flow  min-cut  theorem  for  uniform  multicommod¬ 
ity  flow  problems  with  applications  to  approxima¬ 
tion  algorithms,”  Proceedings  of  the  t9th  Annual 
Symposium  on  Foundations  of  Computer  Science, 
IEEE,  1988,  to  ^pear. 

[13]  C.  E.  Leiserson,  "Fat-trees;  universal  networks  for 
hardware-efficient  supercomputing,”  IEEE  Trans¬ 
actions  on  Computers.  Vol.  C-34,  No.  10,  October 


1985,  pp.  892-901. 

[14]  C.  E.  Leiser- 

Bon  and  B.  M.  Ma^,  "Communication-efficient 
parallel  graph  algorithms  for  distributed  random- 
access  machines,”  Algorithm/ica,  Vol.  3,  pp.  53-77, 
1988. 

[15]  R.  Miller,  V.  K.  Prasanna-Kumar,  D.  Reisis, 
and  Q.  F.  Stout,  "Meshes  with  reconfigurable 
buses,”  Advanced  Research  in  VLSI:  Proceedings 
of  the  Fifth  MIT  Conference,  J.  Allen  and  F.  T. 
Leighton,  ed.,  MIT  Press,  Cambridge,  MA,  1988, 
pp.  163-178. 

[16]  J.  K.  Park,  “A  deterministic  routing  algorithm  for 
the  butterfly-fat-tree,”  unpublished  manuscript. 

[17]  N.  Pippenger,  "Parallel  communication  with  lim¬ 
ited  buffers,”  Proceedings  of  the  25th  Annual 
Symposium  on  Foundations  of  Computer  Science, 
IEEE,  1984,  pp.  127-136. 

[18]  A.  G.  Ranade,  “How  to  emulate  shared  mem¬ 
ory,”  Proceedings  of  the  28th  Annual  Symposium 
on  Foundations  of  Computer  Science,  IEEE,  Oc¬ 
tober  1987,  pp.  185-194. 

19  A.  G.  Ranade,  personal  communication. 

20  J.  H.  Reif,  personal  communication. 

21]  J.  H.  Reif  and  L.  G.  Valiant, "A  Logarithmic  time 
sort  for  linear  size  networks,”  Journal  of  the  As¬ 
sociation  for  Computing  Machinery,  Vol.  34,  No. 
1,  January  1987,  pp.  60-76. 

[22]  J.  Spencer,  Ten  Lectures  on  the  Probabilistic 
Method,  SIAM,  Philadelphia,  PA,  1987. 

[23]  C.  D.  Thompson,  A  Complexity  Theory  for 
VLSI,  Ph.D.  thesis.  Department  of  Computer  Sci¬ 
ence,  Carnegie-Mellon  University,  Pittsburgh.  P.\. 
1980. 

[24]  E.  Upfal,  “Eff.'ient  schemes  for  parallel  conv 
munication,”  Proceedings  of  the  ACM  SIGACT- 
SIGOPS  Symposium  on  Principles  of  Distributed 
Computing,  August  1982,  pp.  55-59. 

[25]  L.  G.  Valiant  and  G.  J.  Brebner,  “Universal 
schemes  for  parallel  communication,”  Proceedings 
of  the  JSth  Annual  ACM  Symposium  on  the  The¬ 
ory  of  Computing,  May  1981,  pp.  263-277. 


